{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Naive Bayes\n",
    "$$ \\begin{split} \\mathop{argmax}_{c_k}p(y=c_k|x) &= \\mathop{argmax}_{c_k}p(y=c_k)p(x|y=c_k) \\\\\n",
    "& \\left( due to: p(y=c_k|x) = \\frac{p(y=c_k)p(x|y=c_k)}{p(x)} \\right) \\\\\n",
    "&= \\mathop{argmax}_{c_k}p(y=c_k)\\prod_jp(x^{(j)}|y=c_k) \\end{split} $$\n",
    "Use Maximum Likelihood Estimate(MLE) to evaluate $ p(y=c_k)$ and $ p(x^{(j)}|y=c_k) $ in datasets.\n",
    "$$ \\hat{p}(y=c_k) = \\frac{\\sum_i I(y_i=c_k)}{N} \\\\\n",
    "\\hat{p}(x^{(j)}=a_j|y=c_k) = \\frac{\\sum_i I(x_i^{(j)}=a_j,y=c_k)}{I(y_i=c_k)}\n",
    "$$\n",
    "Bayesian estimation add $ \\lambda $ on numerator and denominator in MLE."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Naive Bayes in Scikit-learn\n",
    "Classifiers: GaussianNB, MultinomialNB, BernoulliNB\n",
    "\n",
    "## Documents Classification\n",
    "Use TF-IDF(Term Frequency and Inverse Document Frequency) of term in documents as feature\n",
    "$$ TF-IDF = TF*IDF \\\\\n",
    "TF(t) = \\frac {\\text{Number of times term t appears in a document}}{\\text{Total number of terms in the document}}\\\\\n",
    "IDF(t) = log_e\\frac {\\text{Total number of documents}}{\\text{Number of documents with term t in it + 1}} $$\n",
    "Bag of Words"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### TfidfVectorizer\n",
    "sklearn.feature_extraction.text.TfidfVectorizer(stop_words=stop_words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import TfidfVectorizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "vect = TfidfVectorizer()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "documents=[\n",
    "    'my dog has flea problems help please',\n",
    "    'maybe not take him to dog park stupid',\n",
    "    'my dalmation is so cute I love him',\n",
    "    'stop posting stupid worthless garbage',\n",
    "    'mr licks ate my steak how to stop him',\n",
    "    'quit buying worthlsess dog food stupid',\n",
    "]\n",
    "targets=[0,1,0,1,0,1] # 0 normal, 1 insult"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "tf_matrix = vect.fit_transform(documents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['ate', 'buying', 'cute', 'dalmation', 'dog', 'flea', 'food', 'garbage', 'has', 'help', 'him', 'how', 'is', 'licks', 'love', 'maybe', 'mr', 'my', 'not', 'park', 'please', 'posting', 'problems', 'quit', 'so', 'steak', 'stop', 'stupid', 'take', 'to', 'worthless', 'worthlsess']\n"
     ]
    }
   ],
   "source": [
    "print(vect.get_feature_names())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'my': 17, 'dog': 4, 'has': 8, 'flea': 5, 'problems': 22, 'help': 9, 'please': 20, 'maybe': 15, 'not': 18, 'take': 28, 'him': 10, 'to': 29, 'park': 19, 'stupid': 27, 'dalmation': 3, 'is': 12, 'so': 24, 'cute': 2, 'love': 14, 'stop': 26, 'posting': 21, 'worthless': 30, 'garbage': 7, 'mr': 16, 'licks': 13, 'ate': 0, 'steak': 25, 'how': 11, 'quit': 23, 'buying': 1, 'worthlsess': 31, 'food': 6}\n"
     ]
    }
   ],
   "source": [
    "print(vect.vocabulary_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0.        , 0.        , 0.        , 0.        , 0.2836157 ,\n",
       "        0.40966432, 0.        , 0.        , 0.40966432, 0.40966432,\n",
       "        0.        , 0.        , 0.        , 0.        , 0.        ,\n",
       "        0.        , 0.        , 0.2836157 , 0.        , 0.        ,\n",
       "        0.40966432, 0.        , 0.40966432, 0.        , 0.        ,\n",
       "        0.        , 0.        , 0.        , 0.        , 0.        ,\n",
       "        0.        , 0.        ],\n",
       "       [0.        , 0.        , 0.        , 0.        , 0.28007245,\n",
       "        0.        , 0.        , 0.        , 0.        , 0.        ,\n",
       "        0.28007245, 0.        , 0.        , 0.        , 0.        ,\n",
       "        0.40454634, 0.        , 0.        , 0.40454634, 0.40454634,\n",
       "        0.        , 0.        , 0.        , 0.        , 0.        ,\n",
       "        0.        , 0.        , 0.28007245, 0.40454634, 0.33173378,\n",
       "        0.        , 0.        ],\n",
       "       [0.        , 0.        , 0.40966432, 0.40966432, 0.        ,\n",
       "        0.        , 0.        , 0.        , 0.        , 0.        ,\n",
       "        0.2836157 , 0.        , 0.40966432, 0.        , 0.40966432,\n",
       "        0.        , 0.        , 0.2836157 , 0.        , 0.        ,\n",
       "        0.        , 0.        , 0.        , 0.        , 0.40966432,\n",
       "        0.        , 0.        , 0.        , 0.        , 0.        ,\n",
       "        0.        , 0.        ],\n",
       "       [0.        , 0.        , 0.        , 0.        , 0.        ,\n",
       "        0.        , 0.        , 0.490779  , 0.        , 0.        ,\n",
       "        0.        , 0.        , 0.        , 0.        , 0.        ,\n",
       "        0.        , 0.        , 0.        , 0.        , 0.        ,\n",
       "        0.        , 0.490779  , 0.        , 0.        , 0.        ,\n",
       "        0.        , 0.4024458 , 0.3397724 , 0.        , 0.        ,\n",
       "        0.490779  , 0.        ],\n",
       "       [0.37002943, 0.        , 0.        , 0.        , 0.        ,\n",
       "        0.        , 0.        , 0.        , 0.        , 0.        ,\n",
       "        0.25617597, 0.37002943, 0.        , 0.37002943, 0.        ,\n",
       "        0.        , 0.37002943, 0.25617597, 0.        , 0.        ,\n",
       "        0.        , 0.        , 0.        , 0.        , 0.        ,\n",
       "        0.37002943, 0.30342943, 0.        , 0.        , 0.30342943,\n",
       "        0.        , 0.        ],\n",
       "       [0.        , 0.44907696, 0.        , 0.        , 0.31090155,\n",
       "        0.        , 0.44907696, 0.        , 0.        , 0.        ,\n",
       "        0.        , 0.        , 0.        , 0.        , 0.        ,\n",
       "        0.        , 0.        , 0.        , 0.        , 0.        ,\n",
       "        0.        , 0.        , 0.        , 0.44907696, 0.        ,\n",
       "        0.        , 0.        , 0.31090155, 0.        , 0.        ,\n",
       "        0.        , 0.44907696]])"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tf_matrix.toarray()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### CountVectorizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "c_vect = CountVectorizer()\n",
    "c_matrix = c_vect.fit_transform(documents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['ate', 'buying', 'cute', 'dalmation', 'dog', 'flea', 'food', 'garbage', 'has', 'help', 'him', 'how', 'is', 'licks', 'love', 'maybe', 'mr', 'my', 'not', 'park', 'please', 'posting', 'problems', 'quit', 'so', 'steak', 'stop', 'stupid', 'take', 'to', 'worthless', 'worthlsess']\n"
     ]
    }
   ],
   "source": [
    "print(c_vect.get_feature_names())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,\n",
       "        1, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
       "       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0,\n",
       "        0, 0, 0, 0, 0, 1, 1, 1, 0, 0],\n",
       "       [0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0,\n",
       "        0, 0, 1, 0, 0, 0, 0, 0, 0, 0],\n",
       "       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,\n",
       "        0, 0, 0, 0, 1, 1, 0, 0, 1, 0],\n",
       "       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0,\n",
       "        0, 0, 0, 1, 1, 0, 0, 1, 0, 0],\n",
       "       [0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
       "        0, 1, 0, 0, 0, 1, 0, 0, 0, 1]], dtype=int64)"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "c_matrix.toarray()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['ate', 'ate my', 'buying', 'buying worthlsess', 'cute', 'cute love', 'dalmation', 'dalmation is', 'dog', 'dog food', 'dog has', 'dog park', 'flea', 'flea problems', 'food', 'food stupid', 'garbage', 'has', 'has flea', 'help', 'help please', 'him', 'him to', 'how', 'how to', 'is', 'is so', 'licks', 'licks ate', 'love', 'love him', 'maybe', 'maybe not', 'mr', 'mr licks', 'my', 'my dalmation', 'my dog', 'my steak', 'not', 'not take', 'park', 'park stupid', 'please', 'posting', 'posting stupid', 'problems', 'problems help', 'quit', 'quit buying', 'so', 'so cute', 'steak', 'steak how', 'stop', 'stop him', 'stop posting', 'stupid', 'stupid worthless', 'take', 'take him', 'to', 'to dog', 'to stop', 'worthless', 'worthless garbage', 'worthlsess', 'worthlsess dog']\n"
     ]
    }
   ],
   "source": [
    "# default ngram_range is (1, 1), token_pattern=’(?u)\\b\\w\\w+\\b’\n",
    "c_vect_ngram = CountVectorizer(ngram_range=(1, 2))\n",
    "c_matrix_ngram = c_vect_ngram.fit_transform(documents)\n",
    "print(c_vect_ngram.get_feature_names())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
